Fine-Grained Micro-Tasks for MapReduce Skew-Handling

نویسندگان

  • Josh Rosen
  • Bill Zhao
چکیده

Recent work on MapReduce has considered the problems of skew, where a job’s tasks exhibit large variance in size and processing cost, and stragglers, tasks that run slowly due to conditions on particular nodes. In this paper, we discuss an extremely simple approach to mitigating skew and stragglers: break the workload into many small tasks that are dynamically scheduled at runtime. This approach is only effective in systems with high-throughput, low-latency task schedulers and efficient data materialization, so we propose techniques for scaling these components. To demonstrate the efficacy of this technique, we compare micro-tasks to other skew handling techniques using the Spark cluster computing framework.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Handling Data Skew in MapReduce

MapReduce systems have become popular for processing large data sets and are increasingly being used in e-science applications. In contrast to simple application scenarios like word count, e-science applications involve complex computations which pose new challenges to MapReduce systems. In particular, (a) the runtime complexity of the reducer task is typically high, and (b) scientific data is ...

متن کامل

A Survey on Partitioning Skew Diminishing Techniques in Hadoop MapReduce Environment

In the era of Big Data, it creates large size of structured and unstructured data. MapReduce is an effective tool for parallel data processing. One significant issue in practical MapReduce applications is data skew: the imbalance in the amount of data assigned to each task. This causes some tasks to take much longer to finish than others and can significantly impact performance. Parallel data p...

متن کامل

Handling Skew in Multiway Joins in Parallel Processing

Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is transferred from the mappers to the reducers. In...

متن کامل

FMEM: A Fine-grained Memory Estimator for MapReduce Jobs

MapReduce is designed as a simple and scalable framework for big data processing. Due to the lack of resource usage models, its implementation Hadoop hands over resource planning and optimizing works to users. But users also find difficulty in specifying right resource-related, especially memory-related, configurations without good knowledge of job’s memory usage. Modeling memory usage is chall...

متن کامل

Handling Data Skew in MapReduce Cluster by Using Partition Tuning

The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data pro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012